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1.  Introduction 


A  robotic  system  depends  on  a  variety  of  on-board  sensors  providing  information  concerning  its 
environment  in  order  to  accomplish  required  mission  objectives.  Examples  of  typical  mission 
objectives  for  robotic  systems  are  autonomous  mobility  and  object  detection.  In  general,  most  of 
these  sensors  are  capable  of  very  aecurately  pereeiving  only  a  narrow  aspeet  of  the  environment. 
For  example,  ladar^,  radar,  and  sonar  sensors  provide  depth  and  displacement  information,  while 
infrared  (IR)  sensors  provide  data  about  thermal  emissions  within  the  environment.  On  the  other 
hand,  maehine  vision  systems  using  a  “daylight”  camera(s)  ean  be  one  of  the  most  informative 
sensors,  providing  information  across  a  wide  range  of  sensor  modalities  (e.g.,  color,  shading, 
texture,  etc.).  As  Bischoff  and  Graefe  (1998)  observed,  “Vision  is  the  most  powerful  sensor 
modality  for  providing  rieh  and  timely  information  on  a  robot’s  environment.”  Unfortunately, 
the  versatility  of  the  vision  system  information  is  often  accompanied  by  the  complexity  of  the 
data  analysis.  Even  a  seemingly  simple  question  such  as  the  color  of  an  observed  object  is 
confounded  by  faetors  sueh  as  illumination  (i.e.,  the  eolor  consisteney  problem),  whieh  often 
lead  to  ineonsistent  results. 

Faced  with  this  disparity  in  the  information  provided  by  the  sensors  about  robotic  systems,  we 
naturally  exploit  the  possible  benehts  offered  by  the  “integration”  or  “fusion”  of  data  from 
multiple  sensors  to  construct  a  broader  and  more  inclusive  model  of  the  robot’s  environment. 
While  multi-sensor  data  fusion  appears  to  be  a  eommon  approach  in  the  target  reeognition  and 
automatic  target  recognition  communities  (e.g..  Hall,  1992;  Stevens,  Beveridge,  and  Goss,  1997), 
mueh  less  work  is  reported  in  the  literature  about  data  fusion  relative  to  eonstrueting  the 
environment  of  a  robotic  system.  The  work  by  Abidi  and  Gonzalez  (1992)  provides  an 
introduction  to  the  subjeet.  Of  interest  to  us  in  this  researeh  is  the  fusion  and  integration  of  ladar 
sensor  data  and  imagery  data  from  a  stereo  eamera  pair. 

Output  of  a  ladar  sensor  is  range  (distanee)  information  based  on  the  time  of  flight  of  a  laser 
pulse  emitted  by  the  sensor  that  is  reflected  off  an  object  and  back  to  the  sensor.  Thus,  the  range 
information  is  a  direct  measure  and  is  generally  aecurate.  Figure  1  provides  typical  ladar  range 
data  for  real-time  ladar  sensors  represented  as  an  image  using  false  color  to  quantify  range.  As 
can  be  observed  in  the  figure,  the  resolution  of  the  range  image  is  substantially  less  than  that 
available  with  most  eamera  data.  In  addition,  the  ladar  data  suffer  from  what  is  known  as  the 
“mixed  point  problem”  (Dias,  Sequeira,  Gonealves,  &  Vaz,  2001).  Essentially,  the  mixed  point 
problem  results  from  the  fact  that  the  laser  pulse  has  a  non-zero  width  (i.e.,  appears  more  like  a 
disc  than  a  point).  At  edges  or  depth  discontinuities  in  the  scene,  the  laser  pulse  reflects  from 
objects  in  both  the  foreground  and  background.  In  this  case,  the  measured  distanee  is  a 


'An  acronym  of  laser  detection  and  ranging,  ladar  uses  laser  light  for  detection  of  speed,  altitude,  direction  and 
range;  it  is  often  called  laser  radar.  See  the  photonics  dictionary  -  web  site:  http://www.photonics.com/dictionary/. 
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combination  of  the  distances  to  foreground  and  background  objects.  As  a  result,  edges  often 
tend  to  exhibit  sawtooth-like  patterns,  several  instances  of  whieh  are  evident  in  figure  1 .  The 
image  of  the  same  scene  as  viewed  from  the  left  camera  of  a  stereo  camera  pair  is  shown  in 
figure  2.  Clearly  visible  are  many  features  not  present  in  the  ladar  range  image  (e.g.,  shadows, 
eolor,  and  elear  boundaries  at  depth  diseontinuities).  Camera  data  are  a  measurement  of  the 
energy  (intensity)  of  reflected  light  off  objeet  surfaces  and  ehanges  depend  on  the  scene 
illumination.  Thus,  most  of  the  information  obtained  from  camera  data  is  the  result  of  some 
form  of  analysis.  The  most  important  derived  information  for  a  stereo  camera  pair  is  the  three- 
dimensional  (3-D)  reeonstruction  of  the  scene  (i.e.,  world  model)  via  a  geometric  analysis.  For 
relatively  “simple”  environments^,  both  ladar  and  stereopsis  tend  to  provide  acceptable  results. 
However,  the  same  is  not  necessarily  true  for  more  “eomplex  environments”. 


Figure  1.  False  color  ladar  range  image  (Oberle  &  Flaas,  2002). 


Figure  2.  Left-hand  camera  image  from  stereo  camera  pair  of  ladar 
scene  in  figure  1  (Oberle  &  Flaas,  2002). 


A  “simple”  environment  is  one  in  which  there  are  few  depth  discontinuities  and  object  surfaces  tend  to  be 
smooth. 
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To  summarize,  ladar  data  consist  of  accurate  spatially  low  resolution  range  data  while  stereo 
camera  data  consist  of  spatially  high  resolution  but  generally  noisy  reflectance  data  from  which 
range  data  can  be  derived.  In  addition,  the  stereo  camera  data  provide  scene  information  not 
available  with  the  ladar  (e.g.,  color  or  sharp  depth  discontinuities).  The  research  described  in 
this  report  focuses  on  improving  the  calculated  world  model  of  complex  environments  through 
the  use  of  data  integration  and  fusion  (Abidi  &  Gonzalez,  1992)  of  ladar  sensor  data  and  stereo 
camera  imagery.  Our  specific  research  objectives  are  to 

1.  Improve  the  solution  to  the  stereo  correspondence  problem^  and  by  extension,  the  3-D 
stereo  reconstruction"^  problem  by  using  data  integration^  with  ladar  range  data  as  a  priori 
disparity  information;  and 

2.  Improve  the  3-D  world  model  through  the  data  fusion^  of  the  improved  3-D  stereo 
reconstruction  information  with  ladar  data. 

A  number  of  researchers  have  used  data  fusion  of  ladar  and  vision  data  to  enhance  the  3-D  world 
model.  For  example,  Dias,  Sequeira,  Goncalves,  and  Vaz  (2001)  used  fusion  to  address  the 
mixed  point  problem  for  both  indoor  and  outdoor  environments.  Nickels,  Castano,  and  Cianci 
(2003)  presented  a  unified  architecture  for  fusing  lidar’  and  stereo  range  data  to  create  a  summary 
map  of  obstacles  and  free  space  surrounding  a  robot.  Spero  and  Jarvis  (2002)  detailed  their 
efforts  to  fuse  imagery  data  and  ladar  to  construct  a  high  resolution  model  of  the  environment  in 
terms  of  surface  shape  and  color  (a  common  approach  to  obstacle  detection  and  tracking  in  the 
unmanned  ground  vehicle  community  (e.g.,  see  Chang,  Hong,  Rasmussen,  &  Shneier,  2002). 
However,  our  literature  review  yielded  no  research  addressing  the  use  of  the  ladar  data  as  a  priori 
information  to  improve  the  solution  of  the  stereo  correspondence  problem. 

The  purpose  of  this  report  is  to  describe  a  proposed  approach  to  accomplish  the  research 
objectives  as  enumerated  and  to  detail  the  current  status  of  the  research.  In  section  2,  a  proposed 
architecture  to  accomplish  the  stated  objectives  is  presented  and  discussed.  Section  3  describes 
relevant  characteristics  of  the  application  domain  (i.e.,  complex  environment)  and  their  influence 
on  the  selection  of  the  “stereo  matching”  algorithm  (used  to  solve  the  stereo  correspondence 
problem)  selected  for  the  initial  proof-of-concept  experiments.  Results  of  these  experiments  are 
provided  in  section  4.  Finally,  in  section  5,  a  summary  and  outline  of  future  work  are  presented. 


-5 

Correspondence  Problem:  “Which  parts  of  the  left  and  right  images  are  projections  of  the  same  scene  element?” 
(Tmcco  &  Verri,  1998) 

"^Reconstruction  Problem:  “Given  a  number  of  corresponding  parts  of  the  left  and  right  images,  and  possibly 
information  on  the  geometry  of  the  stereo  system,  what  can  we  say  about  the  3-D  location  and  structure  of  the 
observed  object?”  (Trucco  &  Verri,  1998) 

^Synergistic  use  of  sensor  data  to  accomplish  specific  task. 

^Combining  data  to  generate  a  single  model  representation. 

^Another  acronym  of  laser  detection  and  ranging,  same  principle  as  ladar. 
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2.  Architecture 


A  flow  diagram  of  the  necessary  steps  to  accomplish  the  research  objectives  is  shown  in  figure  3. 
As  illustrated  in  the  figure,  the  processes  are  separated  into  five  different  functional  layers. 
Within  the  pre-processing  and  data  integration  layers,  individual  algorithms  necessary  to  achieve 
the  research  objectives  are  identified.  At  the  present  time,  a  number  of  the  algorithms  in  these 
layers  have  been  completed.  Much  less  specific  is  the  data  fusion  layer;  algorithms  in  this  layer 
will  augment  the  work  of  Dias  et  al.  (2001);  Nickels,  Castano,  and  Cianci  (2003);  Spero  and 
Jarvis  (2002);  Chang  et  al.  (2002),  and  others  as  we  continue  our  work.  A  discussion  of  the  first 
three  functional  layers  follows. 

2.1  Input  Layer 

The  ladar  and  camera  data  listed  as  input  in  the  input  layer  of  figure  3  represent  the  minimal 
input  required  to  perform  the  analysis.  Additional  input  that  would  be  useful  to  the  analysis 
include  range  or  intensity  images  from  the  ladar  and  left-right  camera  registration.  Sources  of 
input  error  that  can  propagate  throughout  the  analysis  involve  the  camera  calibration  information 
and  the  3-D  ladar  data.  On  a  moving  vehicle,  especially  over  rough  terrain,  vibrations  can  result 
in  changing  camera  settings  that  affect  the  camera  calibration.  Although  range  is  a  direct 
measurement  of  the  ladar  sensor,  3-D  coordinate  data  are  a  derived  measure.  Essentially,  the 
ladar  system  uses  a  spherical  coordinate  system  in  determining  the  3-D  coordinates.  The 

range,  p,  is  directly  measured  while  the  two  spherical  coordinate  angles,  ^and  (j),  are  associated 
with  the  location  of  the  laser  emitter.  For  real-time  systems,  the  angles  and  the  emitter  location 
are  often  based  on  a  pre-operation  calibration.^  This  calibration,  as  with  the  camera  calibrations, 
could  change  during  periods  of  operation. 

2.2  Pre-Processing  Layer 

The  pre-processing  layer  is  optional  if  the  ladar  sensor  and  cameras  are  registered  off  line  and  the 
registration  does  not  change  during  operation,  i.e.,  the  ladar  sensor  and  cameras  are  rigidly 
mounted  and  move  as  a  single  unit.  If  this  is  not  the  case,  then  three  registrations  (ladar-left 
camera,  ladar-right  camera,  and  left-right  camera)  are  required  in  order  to  complete  the  analysis. 
However,  given  any  two  of  the  registrations,  the  third  can  be  determined.  The  pre-processing 
layer  as  shown  assumes  that  the  cameras  are  not  registered.  If  the  cameras  are  registered,  then 
only  the  ladar-left  or  ladar-right  camera  registration  needs  to  be  determined.  In  either  case,  it  is 
necessary  to  determine  a  set(s)  of  matching,  corresponding,  or  homologous  points  between  the 
3-D  ladar  data  and  the  2-D  image  data  (left,  right,  or  both  cameras).  If  the  input  from  the  ladar 
sensor  includes  either  a  range  or  intensity  image,  the  image  can  be  used  in  the  matching. 

o 

Private  communications,  G.  Haas,  U.S.  Army  Research  Laboratory,  April  2004. 
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Otherwise,  a  range  image  for  the  ladar  data  must  be  created.  A  number  of  procedures  can  be 
used  to  create  the  range  image,  the  simplest  being  to  use  false  color  to  quantify  the  range  and  to 
use  the  scanning  properties  of  the  ladar  (e.g.,  number  of  emission  per  horizontal  line,  number  of 
vertical  positions,  etc.)  to  define  the  image  size.  Figure  1  is  an  example  of  this  approach. 
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Figure  3.  Proposed  architecture  for  research  effort. 
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Once  a  ladar  image  (intensity  or  range)  is  available,  details  of  the  matehing  algorithm  must  be 
addressed.  The  most  direet  approaeh  is  to  manually  seleet  the  eorresponding  points.  Although 
potentially  time  eonsuming,  this  approaeh  should  result  in  a  relatively  aeeurate  set(s)  of  matehing 
points  and  is  the  approaeh  used  in  this  work.  However,  this  approaeh  is  only  applieable  if  the 
registration(s)  need  to  be  ealeulated  infrequently.  If  the  ladar  and  eameras  are  not  rigidly 
mounted,  the  registration(s)  will  have  to  be  performed  on  a  eontinuous  basis  and  the  matehing 
proeess  will  have  to  be  automated.  Corners  (Elstrom,  1998)  or  edges  (Dias  et  ah,  2001)  are  two 
eommon  features  frequently  used  in  the  determination  of  eorresponding  points.  However,  sinee 
one  of  the  images  is  based  on  ladar  range  or  intensity  and  the  other  is  based  on  a  eamera,  eare 
must  be  taken  in  using  an  automated  evaluation  of  the  “goodness”  of  the  mateh.  Standard 
eorrelation  teehniques  may  not  be  viable  in  these  eireumstanees. 

Ladar-eamera  registration  involves  determining  the  rigid  body  transformation  between  the  ladar 
and  eamera  eoordinate  systems  based  on  a  matehed  set  of  3-D  (ladar)  and  2-D  (eamera) 
eoordinates.  A  eomparison  of  approaehes  to  aeeomplish  this  ealeulation  has  been  performed 
with  several  aeeeptable  methods  identified  (Oberle  &  Haas,  2004).  Two  methods,  one  by 
DeMenthon  and  Davis  (1995)  and  another  by  Bouguet  (2003)  were  used. 

2,3  Data  Integration  Layer 

It  is  within  the  data  integration  layer  that  the  major  effort  of  our  work  has  been  foeused  to  date. 
The  prineipal  novelty  of  this  work  is  the  integration  of  3-D  ladar  information  as  an  a  priori 
image  disparity  map  to  improve  the  solution  to  the  stereo  eorrespondenee  problem. 

The  first  step  in  ereating  the  a  priori  disparity  map  from  the  3-D  ladar  data  is  to  projeet  the  ladar 
data  onto  the  left  and  right  eamera  images  while  simultaneously  building  a  table  of  left  and  right 
image  pixel  pairs  that  are  images  of  the  same  3-D  ladar  point.  A  pin-hole,  projeetive  eamera 
model  is  used  to  perform  the  projeetions.  As  eaeh  3-D  ladar  point  is  projeeted  onto  the  left  and 
right  eamera  images,  a  “partial  mapping”  between  left-image  pixels  and  right-image  pixels  is 
generated.  The  mapping  is  termed  partial  sinee  not  every  pixel  in  either  the  left  or  right  image  is 
guaranteed  to  be  in  the  range  of  the  projeetion  of  the  3-D  ladar  points.  Although  the  ealeulation 
of  the  projeetion  of  a  3-D  point  onto  an  image  is  straightforward,  the  overall  mapping  must  be 
eonstrueted  in  sueh  a  way  to  ensure  that  the  resulting  eorrespondenee  between  pixels  in  the  left 
and  right  images  is  unique.  Given  two  3-D  ladar  points,  there  are  four  possible  results  for  the 
mapping  as  illustrated  in  figure  4.  In  eases  I  and  4,  the  eorrespondenee  between  the  image 
pixels  is  unique.  However,  in  eases  2  and  3,  a  single  pixel  in  the  left  (right)  image  eorresponds 
to  two  pixels  in  the  right  (left)  image.  To  resolve  this  ambiguity,  the  3-D  ladar  point  with  the 
greatest  range  and  the  assoeiated  pixel  eorrespondenee  is  disearded.  The  motivation  for  this 
deeision  is  based  on  the  faet  that  if  two  3-D  points  map  to  the  same  point  in  an  image  plane,  only 
the  elosest  point  is  visible  with  the  other  point  being  oeeluded.  Code  to  perform  this  partieular 
algorithm  has  been  eompleted. 
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Figure  4.  Possible  outcomes  of  mapping  two  3-D  ladar  points  onto  left  and  right  camera  images. 

In  order  to  reduce  computation  time,  algorithms  designed  to  solve  the  stereo  correspondence 
problem  expect  the  stereo  image  data  to  be  rectified.  This  is  also  true  for  all  the  algorithms 
undergoing  consideration  for  use  in  this  work.  Thus,  the  next  step  in  computing  the  a  priori 
disparity  map  and  preparing  to  solve  the  correspondence  problem  is  rectification  of  the  image 
data  and  adjustment  of  the  associated  pixel  correspondences  determined  in  the  first  step  just 
discussed.  Essentially,  rectified  images  are  ones  in  which  the  same  scene  elements  are  on  the 
same  horizontal  scan  line  in  each  image.  If  the  cameras  of  the  stereo  camera  pair  are  calibrated 
and  registered,  then  rectification  can  be  accomplished  by  the  mapping  of  the  left  and  right 
images  to  new  images  whose  coordinate  systems  are  related  by  only  a  translation  along  the 
x-axis.  This  approach  is  described  in  almost  every  book  on  computer  vision  (e.g.,  Faugeras, 
1993,  or  Trucco  &  Verri,  1998).  Unfortunately,  if  the  camera  calibration  or  registration 
information  is  erroneous,  the  rectified  images  tend  to  be  vertically  shifted  (Oberle,  2004).  As 
mentioned  earlier,  because  of  vehicle  vibration,  the  camera  calibration  and/or  registration 
information  will  most  likely  change  during  operations.  Thus,  for  this  work,  a  rectification 
algorithm  not  dependent  on  either  camera  calibration  or  registration  information  will  be 
implemented.  The  algorithm  will  be  based  on  the  work  of  Lim,  Mittal,  and  Davis  (2004); 
Pollefeys,  Koch,  and  Van  Gool  (1999);  and  Loop  and  Zhang  (1999).  At  the  same  time  that  the 
images  are  rectified,  the  table  of  left-right  pixel  correspondences  is  adjusted  to  remain  consistent 
with  the  rectified  images. 
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Once  the  images  are  rectified  and  the  table  of  left-right  pixel  eorrespondenees  is  adjusted,  the  a 
/>non  disparity  map  ean  be  eonstrueted.  If  p^={Xy,yy)  represents  the  eoordinates  (in  pixels)  of 
a  point  in  the  left  image  and  P2  =  (^2,72)  represents  the  eoordinates  of  the  eorresponding 
point  in  the  right  image  in  the  table,  the  disparity  is  defined  as 


d{Px,P2) 


||Xi-X2|,  |fi-f2|<^ 

[0,  otherwise 


The  disparity  map  assigns  to  eaeh  pixel  of  the  left-camera  image  the  disparity  with  its 
eorresponding  right  image  point  (if  one  exists)  from  the  table.  If  no  eorresponding  pixel  exists  in 
the  table,  a  value  of  0  is  assigned;  5  in  the  above  definition  represents  a  user-assigned  parameter. 
Appropriate  values  for  5  remain  to  be  experimentally  determined. 


The  next  stage  of  the  data  integration  layer  is  to  solve  the  stereo  eorrespondence  problem  with 
the  a  priori  disparity  map.  Details  eoneeming  the  initial  algorithm  seleeted  for  the  proof-of- 
eoneept  experiments  are  provided  in  seetion  3  with  results  in  seetion  4. 


Onee  the  stereo  eorrespondence  problem  is  solved,  the  final  stage  of  the  data  integration  layer 
(stereo  3-D  reeonstruetion)  is  performed.  A  standard  geometric  triangulation  algorithm  is  used. 
Details  about  the  algorithm  are  given  in  Oberle  and  Haas  (2002). 


3.  Application  Domain  and  Stereo  Correspondence  Algorithm 


3.1  Application  Domain 

As  mentioned  in  the  introduetion,  we  are  predominantly  eoneerned  with  seenes  representing 
eomplex  environments.  We  define  a  eomplex  environment  as  one  in  whieh  there  is  a  “large 
number”  of  depth  diseontinuities.  Generally,  this  implies  that  the  seene  eontains  a  relatively 
“large  number”  of  individual  objects  at  different  depths.  In  addition,  the  individual  objects  will 
tend  to  be  rather  “thin”  (e.g.,  trees  or  poles)  and  are  ealled  “narrow  oeeluding  objeets”  (Brown, 
Bursehka,  &  Hager,  2003).  An  example  of  a  eomplex  environment  is  shown  in  figure  5. 

In  a  stereo  image  pair,  depth  diseontinuities  result  in  oeeluded^  points,  i.e.,  scene  elements 
visible  in  only  one  of  the  two  images.  This  situation  is  illustrated  in  figure  6  where  the  portion  of 
the  objeet  highlighted  in  red  is  visible  in  only  the  right  eamera.  Sinee  the  stereo  eorrespondenee 
problem  is  already  ill  posed  (Seharstein,  Szeliski,  &  Zabih,  2001),  oeeluded  points  only  inerease 
the  diffieulty  of  obtaining  an  aeeurate  solution.  Besides  ereating  oeeluded  points,  narrow 
oeeluding  objeets  ereate  situations  in  whieh  the  “ordering  constraint”  is  violated,  further 
eomplieating  solutions  to  the  eorrespondenee  problem.  Many  of  the  algorithms  developed  to 


^Also  referred  to  as  half-occluded  points. 
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solve  the  correspondence  problem  at  some  point  in  their  execution  must  choose  between  a 
number  of  potential  correspondences  (e.g.,  a  pixel  in  the  left  image,  depending  on  the  criteria 
being  used,  may  have  several  equally  likely  correspondences  in  the  right  image). 


Figure  5.  Example  of  a  eomplex  environment. 


To  aid  in  the  selection  of  the  correct  correspondence,  the  algorithms  employ  a  number  of  global 
constraints.  These  constraints  essentially  represent  prior  knowledge  concerning  the  scene. 
Several  common  constraints  are  smoothness  of  the  disparity  gradient,  left-right  consistency,  and 
the  ordering  constraint.  The  ordering  constraint  basically  assumes  that  image  points  will  occur 
in  the  same  order  in  both  images.  For  an  object  with  a  continuous  surface,  this  is  true,  even  if  the 
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surface  is  not  at  a  constant  depth  relative  to  the  cameras.  This  situation  is  illustrated  in  figure  7. 
However,  if  several  distinct  objects  (especially  if  the  objects  are  thin)  are  in  the  field  of  view  of 
the  cameras,  the  ordering  constraint  could  fail,  as  shown  in  figure  8  (red  and  yellow  points  have 
switched  order).  See  Dhond  and  Aggarwal  (1992)  for  additional  details  involving  the  ordering 
constraint  and  stereo  matching  in  the  presence  of  thin  occluding  objects. 

In  summary,  our  desire  to  work  with  complex  environments  imposes  two  conditions  on  whatever 
algorithm  is  selected  to  solve  the  correspondence  problem.  First,  the  algorithm  must  be  robust  in 
terms  of  identifying  occluded  regions.  The  second  condition  is  that  the  algorithm  must  not  rely 
on  the  ordering  constraint  in  resolving  ties  between  possible  matches. 


3,2  Stereo  Correspondence  Algorithm 

Stereo  correspondence  algorithms  and  their  development  is  one  of  the  most  active  research  areas 
within  the  computer  vision  community.  Thus,  numerous  stereo  correspondence  algorithms  exist, 
which  employ  a  variety  of  approaches  available  for  use  in  our  proof-of-concept  experiments.  In 
the  end,  as  many  as  30  different  stereo  correspondence  algorithms  were  considered.  Besides  the 
two  conditions  stated  before,  other  important  considerations  in  the  final  algorithm  selection  are 
dense  disparity  maps^*’,  accuracy,  and  execution  time.  Fortunately,  researchers  at  Middlebury 
College,  Vermont  (http;//cat.middlebury.edu/stereo/;  Scharstein  &  Szeliski,  2002;  Scharstein, 
Szeliski,  &  Zabih,  2001)  have  maintained  a  web  site  over  the  past  several  years  which  contains 
stereo  pairs  (non-complex  scenes  with  occlusions)  with  ground  truth  to  permit  the  comparison  of 
different  stereo  correspondence  algorithms.  The  results  compiled  by  the  Middlebury  College 
researchers  are  used  in  our  final  selection. 


dense  disparity  map  assigns  to  almost  every  pixel  in  one  image  a  eorresponding  pixel  in  the  other  image  or 
identifies  the  pixel  as  being  oeeluded  in  the  other  image. 


10 


Following  the  taxonomy  of  Brown,  Burschka,  and  Hager  (2003),  stereo  eorrespondenee 
algorithms  are  elassified  as  local  methods  or  global  methods.  Local  methods  base  matching 
decisions  on  a  small  number  of  pixels  surrounding  a  given  pixel.  For  example,  matching 
depends  on  intensity  values  within  regularly  sized  neighborhoods  of  the  pixels  and  some  form  of 
similarity  (dis-similarity)  measure,  such  as  sum-of-squared  differences  or  census  metric  (Banks 
&  Corke,  2001;  Sebe,  Lew,  &  Huijsmans,  2000;  Scherer,  Werth,  &  Pinz,  1999;  Bhat  &  Nayar, 
1998),  is  used  to  establish  the  correspondences.  Global  methods  base  the  matching  decisions  on 
scan  lines  or  the  entire  image.  Dynamic  programming  algorithms  across  scan  lines  and  “graph 
cut”  algorithms  that  determine  the  disparity  map  for  the  entire  image  simultaneously  are 
examples  of  global  methods.'' 

Local  method  algorithms  are  also  referred  to  as  window-,  area-,  or  correlation-based  algorithms. 
These  algorithms  represent  some  of  the  earliest  developed  to  solve  the  stereo  correspondence 
problem.  Algorithms  in  this  category  tend  to  execute  rapidly  and  form  the  basis  for  practically 
all  “real-time”  stereo  implementations  (Brown,  Burschka,  &  Hager,  2003;  Hirschmuller,  2001; 
Kimura,  Shinbo,  Yamaguchi,  Kawamura,  &  Nakano,  1999).  Although  most  of  the  local  methods 
produce  dense  disparity  maps,  those  methods  based  on  matching  features  (e.g.,  occlusion  edges, 
comers,  or  domain-specific  features  such  as  road  surface  markings)  do  not.  On  the  other  hand, 
feature-based  methods  tend  to  be  less  sensitive  to  depth  discontinuities  than  other  local  methods. 
However,  as  Brown,  Burschka,  and  Hager  (2003)  state,  “Due  to  the  need  for  dense  depth  maps 
for  a  variety  of  applications  and  also  due  to  improvements  in  efficient  and  robust  block-matching 
methods,  interest  in  feature-based  methods  has  declined  in  the  last  decade.”  Because  of  the  lack 
of  dense  disparity  maps,  we  rejected  feature-matching  algorithms  for  this  work. 

Two  recent  local  method  implementations  that  do  not  rely  on  the  ordering  constraint  or  feature- 
based  matching  have  been  submitted  to  the  Middlebury  College  site  for  comparison  with  other 
methods.  Muhlmann,  Maier,  Hesser,  and  Manner  (2001)  developed  a  correlation-based  method 
using  a  median  filtering  to  remove  outliers  and  left-right  consistency  to  eliminate  false  matches 
to  generate  a  sub-pixel  accurate  disparity  map.  Although  efficient,  the  algorithm  is  ranked 
approximately  27th  of  the  30  algorithms  compared  on  the  Middlebury  College  site.  The  most 
recent  (April  15,  2004)  results  for  the  site  are  provided  in  appendix  A.  Hirschmuller  (2001)  also 
uses  a  correlation-based  method.  He  uses  a  novel  multiple  window  approach  and  a  border 
correction  filter  to  decrease  matching  errors  at  depth  discontinuities.  A  general  error  filter  is 
used  to  further  invalidate  uncertain  matches.  Although  Hirschmuller ’s  algorithm  produces 
improved  results  for  the  Middlebury  College  comparisons,  it  still  ranks  approximately  17th. 
Based  on  these  results,  we  made  the  decision  not  to  use  a  local  method. 


"Other  global  methods,  some  of  whieh  perform  rather  well  in  the  Middlebury  College  comparisons,  include 
layered  approaches  (Baker,  Szeliski,  &  Anandan,  1998;  Shade,  Gortler,  He,  &  Szeliski,  1998),  belief  propagation 
(Sun,  Zheng,  &  Shum,  2003),  and  Markov  random  fields  (Boykov,  Veksler,  &  Zabih,1998).  Our  analysis  indicated 
that  these  approaches  were  not  the  most  suitable  for  our  work. 
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Our  analysis  of  global  methods  indicated  that  the  best  choice  to  achieve  our  objectives  is  a 
graph-cut  algorithm.  Specific  details  concerning  our  choice  are  provided  next.  However, 
numerous  other  global  methods  exist  and  have  been  evaluated  (e.g.,  see  footnote  1 1).  One 
method  that  is  often  used  in  stereo  correspondence  global  methods  is  dynamic  programming 
(Redert,  Tsai,  Hendriks,  &  Katsaggelos,  1998;  Tsai  &  Katsaggelos,  1999),  and  we  felt  that 
several  remarks  about  this  method  and  why  it  was  not  selected  are  warranted.  Cormen, 
Leiserson,  and  Rivest  (1990)  define  dynamic  programming  as  a  mathematical  method  that 
reduces  the  computational  complexity  of  an  optimization  problem  by  decomposing  it  into 
smaller  and  simpler  sub-problems.  Thus,  dynamic  programming  is  not  specific  to  stereovision. 
A  global  cost  function  across  scan  lines  is  computed  in  stages.  Going  from  one  stage  to  the  next 
is  determined  by  a  set  of  constraints.  One  of  the  necessary  constraints  is  the  ordering  constraint 
(Amini,  Weymouth,  &  Jain,  1990).  Since  one  of  our  conditions  for  the  stereo  correspondence 
algorithm  is  that  it  cannot  depend  on  the  ordering  constraint,  no  algorithm  using  dynamic 
programming  is  acceptable  for  our  work. 

Starting  in  the  mid- 1 990 ’s,  a  new  global  method  approach  to  the  stereo  correspondence  problem 
was  developed,  based  on  the  minimization  of  an  “energy  function”  using  graph  cuts  (Boykov, 
Veksler,  &  Zabih,  2001(A),  2001(B);  Boykov  &  Kolmogorov,  2001;  Kolmogorov  &  Zabih, 
2001;  Kolmogorov  &  Zabih,  2002;  Kolmogorov,  Zabih,  &  Gortler,  2003).  Minimization  of  an 
energy  function  is  well  suited  to  our  situation.  It  is  reasonable  to  expect  that  the  solution  to  our 
correspondence  problem  will  not  vary  far  from  the  a  priori  ladar  disparity  information.  In 
addition,  graph-cut  algorithms  do  not  require  the  use  of  the  ordering  constraint.  Thus,  we  chose 
to  use  a  graph-cut  methodology  for  solving  the  stereo  correspondence  problem. 

Kolmogorov  and  Zabih  (2002)  describe  the  graph-cut  approach  as  “The  basic  technique  is  to 
construct  a  specialized  graph  for  the  energy  function  to  be  minimized,  such  (sic)  that  the 
minimum  cut  on  the  graph  also  minimizes  the  energy  (either  globally  or  locally).  The  minimum 
cut  in  turn  can  be  computed  very  efficiently  by  max  (sic)  flow  algorithms.”  Unfortunately,  as 
they  state,  “Minimizing  an  energy  function  via  graph  cuts,  however,  remains  a  technically 
difficult  problem.  Each  paper  constructs  its  own  graph  specifically  for  its  individual  energy 
function,  and  in  some  of  these  cases,  the  construction  is  fairly  complex.”  Since  our  goal  is 
directed  toward  data  integration  and  fusion,  not  the  development  of  a  graph-cut  algorithm,  we 
elected  to  modify  an  existing  graph-cut  algorithm.  An  algorithm  by  Kolmogorov  and  Zabih 
(2001)  described  in  Computing  Visual  Correspondence  with  Occlusions  Using  Graph  Cuts  was 
selected  for  its  explicit  handling  of  occlusions.  Code  for  their  algorithm  is  available  on  line  at 
http://www.cs.cornell.edu/People/vnk/sofiware.html. 

The  basic  steps  for  using  a  graph-cut  algorithm  are 

1 .  Define  the  energy  function, 

2.  Construct  the  appropriate  graph,  and 

3.  Use  a  maximum  flow  algorithm  to  minimize  the  energy  function  via  graph  cuts. 
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As  mentioned,  the  algorithm  used  in  our  work  is  a  modifieation  of  the  Kolmogorov  and  Zabih 
(2001)  algorithm.  Our  modifieation  is  the  inelusion  of  an  additional  term  in  the  energy  function 
representing  the  cost  or  penalty  of  assigning  a  disparity  to  a  pixel  different  from  that  assigned  by 
the  a  priori  ladar  disparity  data.  As  long  as  this  term  is  non-negative,  the  results  of  their 
algorithm  (i.e.,  computation  of  a  strong  local  minimum  for  the  energy  function)  remain  valid 
(Kolmogorov  &  Zabih,  2002). 

A  brief  description  of  the  modified  energy  function  using  the  notation  of  Kolmogorov  and  Zabih 
(2001)  is  provided.  This  illustrates  our  modification  of  the  original  Kolmogorov  and  Zabih 
energy  function.  For  details  concerning  the  construction  of  the  appropriate  graph  and  the  use  of 
a  new  maximum  flow  algorithm  based  on  a-expansion  (Boykov  &  Kolmogorov,  2001)  to 
minimize  the  energy  function,  the  reader  is  referred  to  Kolmogorov  and  Zabih  (2001). 

Notation: 

P  =  set  of  all  pixels,  i.e.,  pixels  left  image  U  pixels  right  image. 

A  =  \i^p,q)  \  p  and  q  are  pixels  in  different  images}  , 

i.e.,  a  set  of  unordered  pairs  of  pixels  that  could  potentially  correspond.  An  element  of  A  is 
termed  an  “assignment.” 

d  ((/>,  g))  =  disparity  between  pixels  p  and  q. 

f  :  assigns  a  1  or  0  to  every  element  (assignment)  of^ ,  referred  to  as  a  “configuration.”  An 
assignment  of  A  is  termed  active  if  it  is  assigned  a  value  of  1 .  Active  assignments  can  be 
thought  of  as  pixels  that  correspond. 

A{f)  =  subset  of  A  consisting  of  active  assignments  according  to  the  configuration /. 

N  p{f)  =  \l^p,q)^  ^(/)} ,  set  of  active  assignments  in  /  that  involve  pixel />. 

Unique  configuration  f:\/p&P  |a^^(/)|<1,  i.e.,  each  pixel  is  involved  in  one  active 

assignment  at  most.  Note  that  occluded  pixels  satisfy  |A^p(/)|  =  0. 

^  |l,  argument  true  or  non-zero 

[O,  otherwise 

^  |{al,a2}|alG  ^,a2G  ^,d(al)  =  d(a2),  and  if  al  =  (/>,^)  anda2  =  (r,5)  with/?  andrl 
[in  the  left  image,  then p  and  r  ox  q  and  s  are  adjacent  pixels  J 


1  9 

Within  a  known  factor  of  the  global  minimum 
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Energy  Function: 

Employing  the  previous  notation,  our  modifieation  of  the  Kolmogorov  and  Zabih  (2001)  energy 
funetion  is  written  in  general  as 

E(/)  =  E*,.(/)  +  E,^(/)+E„(/)+E.„.,J/), 
with  our  modifieation  being  the  ( / )  term.  The  data  term  is  the  eost  assoeiated  with  an 
assignment  being  identified  as  aetive  and  is  given  by 

E,..(/)=  Z  D(a), 

aeA(f) 

in  whieh  for  an  assignment  a  =  {^p,q),'D[a)  =  [^l[p)- l[q))^ ,  with/the  intensity  of  the  pixel. 

The  ladar  term  is  the  eost  assoeiated  with  an  aetive  assignment  that  has  a  disparity  different  from 
the  a  priori  ladar  disparity  data  and  is  given  by 

Eladar(/)=  X  (  ^  (  «  )  “  ^  )  )"  '  T’ (f  (  )  )  . 

a&A(f) 

In  the  expression,  p  is  the  left-image  pixel  of  the  assignment  a,  and  Lip)  is  the  a  priori  ladar 
disparity  assigned  to  pixel  p.  Note  that  if  no  disparity  is  assigned  to  the  pixel  from  the  ladar 
information.  Lip)  =  0  and  no  eost  is  ineurred.  Thus,  the  a  priori  ladar  data  only  influenee  those 
pixels  that  are  in  the  range  of  the  projeetion  of  the  ladar  data  onto  the  left  and  right  eamera 
images.  In  addition,  if  the  resolution  of  the  ladar  data  is  low.  Lip)  will  equal  0  for  most  aetive 
assignments.  A  major  researeh  effort  of  this  work  is  to  investigate  the  effeet  on  the  solution  to 
the  eorrespondenee  problem  resulting  from  different  approaehes  for  extending  the  a  priori 
disparity  information  to  all  aetive  assignments.  The  oeelusion  term  imposes  a  eost  for 
identifying  a  pixel  as  oeeluded.  This  term  is  given  by 

e„.(/)  =  Zc,.2’(|a',(/)|  =  o). 

peP 

The  value  of  Cp  is  defined  next.  Einally,  the  smoothness  term  imposes  a  eost  if  adjaeent  pixels  in 
the  same  image  do  not  have  the  same  disparity.  In  terms  of  assignments,  this  is  equivalent  to 
imposing  a  eost  if  one  assignment  is  present  in  the  eonfiguration  and  another  elose  assignment 
with  the  same  disparity  is  not.  Speoifieally  the  smoothness  term  is  given  by 

E„„(/)=  Z  K„,,.T(/(al)^/(o2)). 

Details  of  the  funetion  are  provided  next. 

The  goal  is  to  determine  the  unique  eonfiguration,  f* ,  that  minimizes  E{f).  The  solution  to  the 
stereo  eorrespondenee  problem  follows  from  the  minimizing  eonfiguration.  Aetive  assignments 
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identify  corresponding  pixels  from  which  the  disparity  can  be  determined,  while  all  pixels  not 
included  in  an  active  assignment  are  classified  as  occluded. 

To  complete  the  description  of  the  energy  function  and  be  defined.  Let 

al  =  {p,q)  and  a2  =  (r,  s)  be  two  assignments  with p  and  r  in  the  same  image.  and  are 

then  defined  as 

C,  =  /l. 

and 

_U  ifmax(|/(;7)-/(r)|,|/(^)-/(5)|)<8 

^a\,a2  ~  i 

[3A  otherwise 

The  value  of  X  is  chosen  empirically,  or  in  the  case  of  the  Kolmogorov  and  Zabih  implemen¬ 
tation  that  we  use,  A.  can  also  be  automatically  determined.  We  chose  to  allow  the  code  to 
automatically  determine  A.,  since  results  presented  by  Kolmogorov  and  Zabih  (2001)  indicated 
that  their  method  is  relatively  insensitive  to  the  specific  choice  of  A.. 


4.  Proof-of-Concept  Calculations 


Since  we  do  not  have  simultaneous  ladar  and  stereo  camera  data  supported  by  ground  truth  for 
scenes  from  complex  environments,  the  ground  truth  imagery  from  the  University  of  Tsukuba, 
Japan  (Scharstein  &  Szeliski,  2002,  2003)  is  used.  The  Tsukuba  imagery  used  is  the  left  and  right 
camera  images  (figure  9)  and  the  ground  truth  images  of  disparity  and  occluded  pixels  (figure  10).'^ 


Figure  9.  Left  (left)  and  right  (right)  Tsukuba  stereo  images. 


1  T 

The  University  of  Tsukuba  imagery  set  is  available  on  line  at  the  Middlebury  College  Vision  web  site.  Y.  Ohta 
and  Y.  Nakamura  of  the  University  of  Tsukuba  supplied  the  imagery  data  set. 
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The  images  are  384  by  288  pixels.  The  ground  truth  images  (figure  10)  have  a  border  of  18 
pixels  in  which  there  is  no  information.  In  the  disparity  image  (left  side  of  figure  10),  this  is  the 
black  border. 

For  the  proof-of-concept  calculations,  the  a  priori  ladar  disparity  data  are  taken  to  be  the  ground 
truth  disparity  data  (left  side  of  figure  10).  Some  differences  exist  between  the  ground  truth 
disparity  data  and  what  would  be  expected  from  actual  a  priori  ladar  disparity  data.  The 
18-pixel  border  of  the  ground  truth  disparity  data  will  be  incorrectly  interpreted  as  occluded 
when  treated  as  the  a  priori  ladar  data.  In  addition,  the  ground  truth  disparity  data  has  been 
“filled  in”  so  that  all  pixels  are  assigned  a  disparity  even  if  the  pixel  is  actually  occluded  (i.e.,  the 
pixels  identified  as  occluded  in  the  image  in  the  right  of  figure  10  are  assigned  disparities  in  the 
image  in  the  left  of  figure  10).  Results  of  calculations  with  and  without  the  use  of  the  a  priori 
ladar  disparity  data  are  presented  in  table  1 .  Comparisons  are  relative  to  the  ground  truth 
information  and  the  calculated  results. 


Figure  10.  Ground  truth  for  Tsukuba  imagery,  disparity  (left)  and  occlusions  (right). 


Table  1.  Results  of  proof-of-concept  calculations. 


Percentage  of  Pixels  Whose  Disparity 
Correctly  Labeled 

(pixels  labeled  as  occluded  by  both  the 
calculation  and  ground  truth  ignored) 

Percentage  of 
Occluded  Pixels 
Correctly  Labeled 

Percentage  of  Pixels 
Incorrectly  Labeled 
as  Occluded 

Calculation  with  a  priori 
ladar  disparity  data 

92.955 

77.4 

1.40 

Calculation  without  a 
priori  ladar  disparity  data 

92.343 

67.6 

1.45 

Based  on  the  results  of  the  Middlebury  College  comparisons  (appendix  A),  the  Kolmogorov  and 
Zabih  algorithm  is  either  1  or  2  in  terms  of  its  performance  for  the  Tsukuba  image  pair.  Thus, 
the  small  increase  in  the  percentage  of  correctly  labeled  pixels  using  the  a  priori  ladar  disparity 
data  is  encouraging.  More  encouraging  is  the  improvement  in  the  results  related  to  occluded 
pixels  for  the  calculation  using  the  a  priori  ladar  disparity  data.  As  discussed  before,  the  a  priori 
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ladar  disparity  data  used  in  the  ealeulation  provided  erroneous  oeelusion  information,  yet  the 
ealeulation  with  the  a  priori  ladar  disparity  data  eorreetly  identified  approximately  15%  (77.4% 
versus  67.6%)  more  oeeluded  pixels  eompared  to  the  ealeulation  without  the  a  priori  ladar 
disparity  data.  Those  pixels  identified  as  oeeluded  in  both  ealeulations  are  shown  in  figure  1 1 . 
Results  for  the  ealeulation  with  the  a  priori  ladar  disparity  data  are  shown  on  the  left  of 
figure  11,  and  results  for  the  ealeulation  without  the  a  priori  ladar  disparity  data  are  shown  on 
the  right.  The  results  for  the  ealeulation  with  the  a  priori  ladar  disparity  data  (left  side  of 
figure  1 1)  appear  for  the  most  part  to  be  eleaner  (e.g.,  areas  inside  red  eireles)  than  for  the  results 
without  the  a  priori  ladar  disparity  data  (right  side  of  figure  11). 


Figure  11.  Occluded  pixels  for  calculation  with  (left)  and  without  (right)  a  priori  ladar  disparity  data. 


In  addition,  the  a  priori  ladar  disparity  data  ealeulation  ineorreetly  labeled  roughly  3.5%  (1.40% 
versus  1.45%)  fewer  pixels  as  oeeluded  eompared  to  the  other  ealeulation. 

Improved  results  for  the  ealeulation  with  the  a  priori  ladar  disparity  data  are  also  evident  in  an 
analysis  of  the  pixels  that  were  ineorreetly  labeled.  The  maximum  disparity  that  eould  be 
assigned  to  a  pixel  for  the  ealeulations  is  15.  For  both  ealeulations  (with  and  without  a  priori 
ladar  disparity  data)  the  largest  differenee  between  any  ealeulated  disparity  and  the  ground  truth 
disparity  is  13.  As  illustrated  in  figure  12,  the  errors  in  the  disparity  assignments  for  the 
ealeulation  with  a  priori  ladar  disparity  data  are  generally  smaller,  eompared  to  the  ealeulation 
without  a  priori  ladar  disparity  data,  with  no  disparity  error  greater  than  6  eompared  to  13  for  the 
other  ealeulation. 

Finally,  a  series  of  ealeulations  was  performed  in  whieh  “white  noise”  was  added  to  the  left  and 
right  stereo  images.  As  expeeted,  the  results  degraded  with  the  differenee  between  the 
ealeulations  using  the  a  priori  ladar  disparity  data  inereasing  as  the  severity  of  the  noise 
inereased.  Details  of  these  ealeulations  are  not  provided. 

Based  on  the  overall  results  for  the  different  ealeulations,  it  appears  that  improvements  in  the 
solution  of  the  stereo  eorrespondenee  problem  ean  result  from  the  use  of  a  priori  ladar  disparity 
data.  Improvements  relative  to  the  solution  without  the  a  priori  ladar  disparity  data  are  observed 
in  the  number  of  eorreetly  labeled  pixels  (disparity  and  oeelusions)  and  a  reduetion  in  the 
magnitude  of  the  error  in  the  disparity  for  those  pixels  that  are  ineorreetly  labeled. 
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-■-Without  A  Priori  Ladar  Disparity  Data  —♦—With  A  Priori  Ladar  Disparity  Data  [ 


Figure  12.  Cumulative  percentage  of  total  pixels  in  error  versus  error  in  disparity. 


5.  Summary  and  Future  Work 


In  this  report,  we  described  an  approach  and  architecture  to  incorporate  data  integration  and 
fusion  of  ladar  sensor  data  and  stereo  camera  imagery  to  produce  high  resolution,  ladar-quality 
3-D  world  models.  Of  particular  interest  is  the  construction  of  world  models  for  scenes 
involving  complex  environments — a  situation  that  is  extremely  difficult  for  traditional  stereo 
algorithms  because  of  the  large  number  of  occluded  regions.  As  stated  earlier  in  the  report,  the 
principal  novelty  of  our  work  is  the  integration  of  3-D  ladar  information  as  an  a  priori  disparity 
map  to  improve  the  solution  to  the  stereo  correspondence  problem.  Proof-of-concept*"^ 
calculations  were  performed  with  a  modified  energy  function  based  upon  the  work  of 
Kolmogorov  and  Zabih.  The  approach  uses  recently  developed  algorithms  for  computer  vision 
incorporating  minimum  cut-maximum  flow  paradigms.  Our  results  indicated  that  data 
integration  of  ladar  data  as  an  a  priori  disparity  map  and  stereo  data  can  produce  improvements 
in  the  solution  to  the  correspondence  problem.  Of  particular  note  is  the  improvement  in  the 
identification  of  occluded  regions  with  the  data  integration. 
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Data  integration  can  improve  the  solution  to  the  stereo  correspondence  problem. 
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Although  we  describe  a  detailed  architecture,  a  number  of  the  necessary  algorithms  in  the  data 
integration  and  fusion  layers  have  not  been  developed.  This  is  especially  true  for  the  data  fusion 
layer.  However,  in  this  layer,  we  expect  to  draw  heavily  on  the  many  efforts  involving  data 
fusion  described  in  the  vision  literature.  Near-term  future  work  will  be  directed  toward 
completing  the  algorithms  of  the  data  integration  layer.  Hopefully,  this  will  include  an 
evaluation  of  additional  stereo  correspondence  algorithms  specifically  developed  for  complex 
environments. 
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Appendix  A.  Results  from  Middlebury  College  Stereo  Vision  Comparison 
(April  15,  2004)  (http://cat.middlebury.edu.stereo/) 


Welcome  to  the  Middlebury  Stereo  Vision  Page 

This  web  site  contains  material  accompanying  our  taxonomy  and  experimental  comparison  of 
stereo  correspondence  algorithms  [1].  It  contains  stereo  data  sets  with  ground  truth,  the  overall 
comparison  of  algorithms,  instructions  on  how  to  evaluate  your  stereo  algorithm  in  our 
framework,  and  our  stereo  correspondence  software. 

Also  available  are  two  new  stereo  data  sets  with  ground  truth  obtained  using  our  structured 
lighting  technique  [2].  These  data  sets  have  a  more  complex  geometry  and  larger  disparity 
ranges  than  the  original  data  sets. 

We  are  continually  inviting  other  researchers  to  run  their  stereo  algorithms  on  the  four  image 
pairs  used  in  our  overall  comparison,  and  to  send  us  the  results.  We  will  then  run  our  evaluator, 
and  report  the  resulting  disparity  error  statistics.  If  you  are  interested  in  participating,  please  go 
to  the  evaluation  page. 

How  to  Cite  the  Materials  on  This  Web  Site: 

We  grant  permission  to  use  and  publish  all  images  and  numerical  results  on  this  website. 
However,  if  you  use  our  data  sets,  and/or  report  performance  results,  we  request  that  you  cite  the 
appropriate  paper(s)  [1,2].  If  you  want  to  cite  this  website,  please  use  the  “stable”  URL 
“www.middlebury.edu/stereo”.  (This  URL  is  currently  auto-forwarded  to 
“cat.middlebury.edu/stereo”,  but  that  may  change.) 
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Algorithms. 
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pages  195-202,  Madison,  WI,  June  2003.  PDF  fde  (1.2  MB) 

Support  for  this  work  was  provided  in  part  by  NSF  CAREER  grant  9984485.  Any  opinions,  findings,  and  conclusions  or 
recommendations  expressed  in  this  material  are  those  of  the  authors  and  do  not  necessarily  reflect  the  views  of  the  National 
Science  Foundation. 
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Table  A-1.  Comparison  of  the  performance  of  different  stereo  algorithms  on  four  test  image  pairs 


Algorithm 

Tsukuba 

Sawtooth 

Venus 

Map 

a|l 

untex. 

disc. 

a|l 

untex. 

disc. 

a|l 

untex. 

disc. 

all 

disc. 

Segm. -based  GC 
[23] 

1.23  3 

0.29  2 

6.944 

0.30  3 

0.00  1 

3.24  3 

0.08  1 

0.01 1 

1.391 

1.4919 

15.4624 

Segm.-i-glob.vis. 

[25] 

1.305 

0.48  5 

7.50  6 

0.201 

0.00  1 

2.30  1 

0.79  4 

0.81 5 

6.377 

1.6321 

16.0726 

Layered  [16] 

1.587 

1.069 

8.82  8 

0.34  4 

0.00  1 

3.35  4 

1.5210 

2.9619 

2.62  3 

03710 

5.2410 

Belief  prop.  [3] 

1.151 

0.42  3 

6.31 1 

0.9810 

0.3014 

4.83  8 

1.00  6 

0.76  4 

9.1313 

0.84i6 

5.2711 

MultiCam  GC  [21] 

1.8510 

1.9415 

6.99  5 

0.62  8 

0.00  1 

6.8612 

1.21  8 

1.9610 

5.716 

0.317 

4.34  9 

Region-Progress. 

[24] 

1.44  6 

0.55  6 

8.187 

0.242 

0.00  1 

2.64  2 

0.99  5 

1.378 

6.40  8 

1.4920 

17.1127 

GC+occl.  [2b] 

1.192 

0.23  1 

6.712 

0.73  9 

0.119 

5.7I10 

1.6413 

2.7517 

5.415 

0.6113 

6.0512 

Improved  Coop. 

[19] 

1.678 

0.77  7 

9.6711 

1.2113 

0.1712 

6.9013 

1.04  7 

1.07  6 

13.6818 

0.29  5 

3.65  6 

GC+occl.  [2a] 

1.274 

0.43  4 

6.90  3 

0.36  5 

0.00  1 

3.65  5 

2.7921 

5.3922 

2.542 

1.7922 

10.0818 

Disc.  pres.  [18] 

1.789 

1.2211 

9.7112 

1.1712 

0.08  8 

5.55  9 

1.6112 

2.2513 

9.O612 

0.32  8 

3.33  5 

Symbiotic  [20] 

2.8715 

1.7114 

11.9013 

1.0411 

0.1310 

7.3215 

0.512 

0.23  2 

7.8810 

0.5012 

6.5413 

Graph  cuts  [la] 

1.9412 

1.0910 

9.4910 

1.3015 

0.06  7 

6.3411 

1.7916 

2.6116 

6.919 

0.316 

3.887 

Var.  win.  [17] 

2.3513 

1.6513 

12.1715 

1.2814 

0.2313 

7.0914 

1.23  9 

1.167 

13.3517 

0.24  3 

2.98  3 

Graph  cuts  [5] 

1.8611 

1.00  8 

9.35  9 

0.42  6 

0.1411 

3.76  6 

1.6915 

2.3014 

5.40  4 

2.3925 

9.3516 

Multiw.  cut  [13] 

8.0827 

6.5324 

25.3328 

0.617 

0.46i7 

4.60  7 

0.53  3 

0.313 

8.O611 

0.26  4 

3.274 

Comp.  win.  [4] 

3.36i8 

3.5418 

12.9118 

1.6118 

0.4516 

7.8716 

1.6714 

2.I811 

13.2416 

0.33  9 

3.94  8 

Realtime  [7] 

4.2522 

4.4722 

15.0522 

1.3216 

0.3515 

9.2117 

1.5311 

1.80  9 

12.3314 

0.8115 

11.3521 

Cooperative  [6] 

3.4919 

3.6519 

14.7720 

2.0319 

2.2923 

13.4122 

2.5720 

3.5220 

26.3827 

0.22  2 

2.371 

Bay.  diff  [lb] 

6.4926 

11.6229 

12.2916 

1.4517 

0.7218 

9.2918 

4.0023 

7.2125 

18.3922 

0.201 

2.49  2 

Stoch.  diff  [9] 

3.9520 

4.0821 

15.4924 

2.4523 

0.9020 

10.5819 

2.4518 

2.4115 

21.8424 

1.3118 

7.7915 

Genetic  [11] 

2.9616 

2.6617 

14.9721 

2.2121 

2.7625 

13.9623 

2.4919 

2.8918 

23.0425 

1.0417 

10.9120 

SSD+MF  [Ic] 

5.2325 

3.8020 

24.6627 

2.2120 

0.7219 

13.9724 

3.7422 

6.8224 

12.9415 

0.6614 

9.3516 

Max  flow  [14] 

2.9817 

2.0016 

15.1023 

3.4724 

3.0026 

14.1925 

2.1617 

2.2412 

21.7323 

3.1326 

15.9825 

Pix-to-pix  [12] 

5.1224 

7.0627 

14.6219 

2.3122 

1.7921 

14.9326 

6.3026 

11.3728 

14.5719 

0.5011 

6.8314 

Scanl.  opt.  [le] 

5.0823 

6.7825 

11.9414 

4.0625 

2.6424 

11.9020 

9.4429 

14.5929 

18.2021 

1.8423 

10.2219 

Dyn.  prog.  [Id] 

4.1221 

4.6323 

12.3417 

4.8428 

3.7128 

13.2621 

10.1030 

15.0130 

17.1220 

3.3327 

14.0423 

Realtime  DP  [26] 

2.8514 

1.3312 

15.6225 

6.2530 

3.9829 

25.1928 

6.4227 

8.1426 

25.3026 

6.4529 

25.1628 

MMHM  [15] 

9.7629 

13.8530 

24.3926 

4.7627 

1.8722 

22.4927 

6.4828 

10.3627 

31.2928 

8.4230 

12.6822 

Shao  [8] 

9.6728 

7.0426 

35.6329 

4.2526 

3.1927 

30.1430 

6.0125 

6.7023 

43.9130 

2.3624 

33.0130 

Max.  surf  [10] 

11.1030 

10.7028 

41.9930 

5.5129 

5.5630 

27.3929 

4.3624 

4.7821 

41.1329 

4.1728 

27.8829 
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Our  implementation: 


[1]  D.  Scharstein  and  R.  Szeliski.  A  Taxonomy  and  Evaluation  of  Dense  Two-Frame  Stereo  Correspondence 
Algorithms,  IJCV,  2002.  Five  algorithm  have  been  implemented: 

a  -  Graph  euts  using  alpha-beta  swaps  (Boykov,  Veksler,  and  Zabih,  PAMI  2001); 
b  -  Bayesian  diffusion  (Seharstein  and  Szeliski,  IJCV  1998); 
e  -  SSD  -I-  min-fdter  (i.e.,  shiftable  windows),  window  size  =  21; 
d  -  Dynamie  programming,  similar  to  Bobiek  and  Intille  (IJCV  1999); 
e  -  Seanline  optimization  (ID  optimization  using  horizontal  smoothness  terms). 

Other  authors'  implementations: 

[2]  V.  Kolmogorov  and  R.  Zabih.  Computing  visual  eorrespondenee  with  oeelusions  using  graph  euts.  ICCV 

2001. 

a  -  original  submission 

b  -  new  submission  with  automatie  parameter  setting  (same  as  in  [21]) 

[3]  J.  Sun,  H.  Y.  Shum,  and  N.  N.  Zheng.  Stereo  matehing  using  belief  propagation.  PAMI  2003  (also  in  ECCV 

2002) 

[4]  O.  Veksler.  Stereo  matehing  by  eompaet  windows  via  minimum  ratio  evele.  ICCV  2001. 

[5]  Y.  Boykov,  O.  Veksler,  and  R.  Zabih.  Fast  approximate  energy  minimization  via  graph  euts.  PAMI  2001. 

[6]  L.  Zitniek  and  T.  Kanade.  A  eooperative  algorithm  for  stereo  matching  and  oeelusion  deteetion.  PAMI  2000. 

[7]  H.  Hirschmuller.  Improvements  in  Real-Time  Correlation-Based  Stereo  Vision.  CVPR  2001  Stereo  Workshop 
/  IJCV  2002. 

[8]  J.  Shao.  Combination  of  Stereo,  Motion  and  Rendering  for  3D  Footage  Display.  CVPR  2001  Stereo  Workshop 
/  IJCV  2002. 

[9]  S.  H.  Lee,  Y.  Kanatsugu,  and  J.-I.  Park.  Hierarchical  stochastic  diffusion  for  disparity  estimation.  CVPR  2001 
Stereo  Workshop  /  IJCV  2002. 

[10]  C.  Sun.  Fast  stereo  matching  using  rectangular  subregioning  and  3D  maximum-surface  techniques.  CVPR 
2001  Stereo  Workshop  /  IJCV  2002. 

[11]  M.  Gong  and  Y.-H.  Yang.  Multi-baseline  Stereo  Matching  Using  Genetic  Algorithm.  CVPR  2001  Stereo 
Workshop  /  IJCV  2002. 

[12]  S.  Birchfield  and  C.  Tomasi.  Depth  discontinuities  by  pixel-to-pixel  stereo.  ICCV  1998. 

[13]  S.  Birchfield  and  C.  Tomasi.  Multiway  cut  for  stereo  and  motion  with  slanted  surfaces.  ICCV  1999. 

[14]  S.  Roy  and  I.  J.  Cox.  A  maximum-flow  formulation  of  the  N-camera  stereo  correspondence  problem.  ICCV 
1998. 

[15]  K.  Miihlmann,  D.  Maier,  J.  Hesser,  and  R.  Manner.  Calculating  Dense  Disparity  Maps  from  Color  Stereo 
Images,  an  Efficient  Implementation.  CVPR  2001  Stereo  Workshop  /  IJCV  2002. 

[16]  M.  Lin  and  C.  Tomasi.  Surfaces  with  Occlusions  from  Layered  Stereo.  Ph.D.  thesis,  Stanford  University,  2002. 

[17]  O.  Veksler.  Fast  Variable  Window  for  Stereo  Correspondence  using  Integral  Images.  CVPR  2003. 

[18]  M.  Agrawal  and  L.  Davis.  Window  Based,  Discontinuity  Preserving  Stereo.  Submitted  to  CVPR  2003. 

[19]  H.  Mayer.  Analysis  of  Means  to  Improve  Cooperative  Disparity  Estimation.  ISPRS  Conf  on  Photogrammetric 
Image  Analysis,  2003. 

[20]  J.  Y.  Goulermas  and  P.  Liatsis.  A  Collective-based  Adaptive  Symbiotic  Model  for  Surface  Reconstruction  in 
Area-based  Stereo.  IEEE  Trans.  Evolutionary  Computation,  vol.7(5),  pp. 482-502,  2003. 

[21]  V.  Kolmogorov  and  R.  Zabih.  Multi-camera  Scene  Reconstruction  via  Graph  Cuts.  ECCV  2002. 

[22]  (Withdrawn) 

[23]  L.  Hong  and  G.  Chen.  Segment-Based  Stereo  Matching  Using  Graph  Cuts.  CVPR  2004. 

[24]  Y.  Wei  and  L.  Quan  Region-Based  Progressive  Stereo  Matching.  CVPR  2004. 
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[25]  M.  Bleyer  and  M.  Gelautz.  A  layered  stereo  algorithm  using  image  segmentation  and  global  visibility 
eonstraints.  Submitted  to  ICIP  2004. 
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